pacman::p_load(tidyverse, sf, httr, jsonlite, rvest, tmap, rsample, GWmodel, SpatialML, ranger)Take-Home Exercise 3b: Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods
1. OVERVIEW
Housing is a key part of wealth for households around the world, with buying a home often being one of the biggest investments people make. Housing prices are influenced by many factors, including broader economic ones like inflation and local factors specific to the property, such as its size or location near services like schools and shopping centers. Traditionally, price prediction models used a method called Ordinary Least Squares (OLS), but this approach doesn’t account for the geographical patterns in housing data, which can lead to inaccurate results. To improve accuracy, Geographically Weighted Models (GWMs) were introduced to better predict housing prices by considering these location-based differences.
2. THE TASK
For this exercise, your task is to build a predictive model to forecast HDB resale prices for the period of July to September 2024, using transaction records from HDB resale data in 2023.
3. THE DATA
For this exercise, you’ll use the HDB Resale Flat Prices dataset from Data.gov.sg as the primary source. The analysis should focus on resale price predictions for either three-room, four-room, or five-room flats. Here are recommended predictors you can use, though you’re welcome to include other relevant variables.
Structural factors: - Area of the unit - Floor level - Remaining lease - Age of the unit - Main Upgrading Program (MUP) completion (optional)
Locational factors: - Proximity to the Central Business District (CBD) - Proximity to eldercare facilities - Proximity to food courts/hawker centers - Proximity to MRT stations - Proximity to parks - Proximity to good primary schools - Proximity to shopping malls - Proximity to supermarkets - Number of kindergartens within 350 meters - Number of childcare centers within 350 meters - Number of bus stops within 350 meters - Number of primary schools within 1 kilometer
These variables provide a foundation for modeling, helping capture both the physical characteristics of the flats and the surrounding amenities that may impact resale prices.
4. GEOSPATIAL DATA WRANGLING
4.1 INSTALLING THE R PACKAGES
This code chunk uses p_load() of the pacman package (stands for Package Manager) to check if the following packages are installed in the computer. The packages will then be launched into R.
This code reads and processes a CSV file containing HDB resale transaction data. It filters the records to include only transactions between January 2023 and September 2024, focusing specifically on three-room flats. It then creates a new address column by combining the block number and street name, and extracts and formats the remaining lease period into years (remaining_lease_yr) and months (remaining_lease_mth). Finally, it displays the processed data for inspection.
4.2 Loading the Data
resale <- read_csv("data/aspatial/resale.csv") %>%
filter(month >= "2023-01" & month <= "2024-09")library(dplyr)
library(stringr)
resale_tidy <- resale %>%
mutate(address = paste(block, street_name)) %>%
mutate(
remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2)),
remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11))
) %>%
filter(flat_type == "5 ROOM")# View the result
print(resale_tidy)# A tibble: 10,886 × 14
month town flat_type block street_name storey_range floor_area_sqm
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2023-01 ANG MO KIO 5 ROOM 306 ANG MO KIO AV… 16 TO 18 123
2 2023-01 ANG MO KIO 5 ROOM 306 ANG MO KIO AV… 04 TO 06 123
3 2023-01 ANG MO KIO 5 ROOM 402 ANG MO KIO AV… 10 TO 12 119
4 2023-01 ANG MO KIO 5 ROOM 259 ANG MO KIO AV… 13 TO 15 135
5 2023-01 ANG MO KIO 5 ROOM 176 ANG MO KIO AV… 04 TO 06 119
6 2023-01 ANG MO KIO 5 ROOM 618 ANG MO KIO AV… 13 TO 15 133
7 2023-01 ANG MO KIO 5 ROOM 520 ANG MO KIO AV… 19 TO 21 118
8 2023-01 ANG MO KIO 5 ROOM 700C ANG MO KIO AV… 19 TO 21 111
9 2023-01 ANG MO KIO 5 ROOM 714 ANG MO KIO AV… 10 TO 12 119
10 2023-01 ANG MO KIO 5 ROOM 253 ANG MO KIO ST… 04 TO 06 128
# ℹ 10,876 more rows
# ℹ 7 more variables: flat_model <chr>, lease_commence_date <dbl>,
# remaining_lease <chr>, resale_price <dbl>, address <chr>,
# remaining_lease_yr <int>, remaining_lease_mth <int>
4.3 Getting Coordinates
This code retrieves and processes geographical coordinates for a list of addresses. It begins by defining the get_coords function, which takes a list of addresses (add_list) and uses the OneMap API to obtain postal codes, latitudes, and longitudes for each address. The function creates a postal_coords data frame to store these results. For each address, if only one matching record is found, its coordinates are added. If multiple results are found, entries with “NIL” postal codes are ignored, and the first valid result is used. If no results are found, the coordinates are set to NA.
Once the coordinates are collected, they are saved as an .rds file. The code then reads this file, converts the data to a spatial (sf) object, specifying the latitude and longitude columns, and transforms it into the SVY21 (EPSG:3414) coordinate system, commonly used in Singapore. Finally, the structure of the new sf object is displayed.
resale_selected <- resale_tidy %>%
filter(month == "2024-09")add_list <- sort(unique(resale_selected$address))get_coords <- function(add_list){
# Create a data frame to store all retrieved coordinates
postal_coords <- data.frame()
for (i in add_list){
#print(i)
r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
data <- fromJSON(rawToChar(r$content))
found <- data$found
res <- data$results
# Create a new data frame for each address
new_row <- data.frame()
# If single result, append
if (found == 1){
postal <- res$POSTAL
lat <- res$LATITUDE
lng <- res$LONGITUDE
new_row <- data.frame(address= i,
postal = postal,
latitude = lat,
longitude = lng)
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
res_sub <- res[res$POSTAL != "NIL", ]
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
new_row <- data.frame(address= i,
postal = NA,
latitude = NA,
longitude = NA)
}
else{
top1 <- head(res_sub, n = 1)
postal <- top1$POSTAL
lat <- top1$LATITUDE
lng <- top1$LONGITUDE
new_row <- data.frame(address= i,
postal = postal,
latitude = lat,
longitude = lng)
}
}
else {
new_row <- data.frame(address= i,
postal = NA,
latitude = NA,
longitude = NA)
}
# Add the row
postal_coords <- rbind(postal_coords, new_row)
}
return(postal_coords)
}coords <- get_coords(add_list)write_rds(coords, "data/geospatial/coords.rds")# Load the spatial data (coords) from an .rds file
coords <- readRDS("data/geospatial/coords.rds")This code performs spatial data processing and mapping with sf and tmap packages. Here’s a summary of each step:
Joining Datasets: It starts by performing a left join between
coords_sf(spatial coordinates of addresses) andresale_tidy(resale transaction data), usingaddressas the common column. This merged data is expected to retain thesfgeometry.Loading and Converting Geospatial Data: The
mpszdataset, containing Singapore’s subzone boundaries, is loaded and converted into ansfobject. It’s reprojected to Singapore’s SVY21 coordinate system (EPSG:3414) for accurate spatial analysis.Mapping Subzones: A basic map of the
mpszdataset is created usingtmap, displaying polygons representing Singapore’s subzones.Handling MULTIPOLYGONS: The code then:
- Converts
MULTIPOLYGONgeometries inmpszto individualPOLYGONgeometries. - Calculates the area of each polygon.
- Selects the largest polygon within each subzone, identified by
SUBZONE_N, retaining only the largest area to simplify further analysis.
- Converts
Filtering by Spatial Join: The
joined_datadataset is filtered to include only points within the largest polygons of each subzone, usingst_within()to spatially subset the points that fall inside the boundaries.Final Map Display: Finally, a map of the
final_datasetis displayed with points (usingtm_dots()), representing the filtered HDB resale transactions overlaid on the subzones map.
This approach ensures that only transactions within the primary subzone boundaries are included for analysis.
library(sf)
# Convert coords to an sf object, specifying latitude first, then longitude
coords_sf <- st_as_sf(coords, coords = c("longitude", "latitude"), crs = 4326)
# Convert to SVY21 (EPSG:3414) if needed
coords_sf <- st_transform(coords_sf, crs = 3414)
# Check the structure of the new sf object
print(coords_sf)Simple feature collection with 487 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
address postal geometry
1 101 SIMEI ST 1 520101 POINT (41144.95 35890.2)
2 102 TECK WHYE LANE 680102 POINT (19216.72 39896.49)
3 102A PUNGGOL FIELD 821102 POINT (36064.38 42301.51)
4 103 CLEMENTI ST 14 120103 POINT (20861.59 33870.73)
5 106 TAMPINES ST 11 521106 POINT (40726.66 36629.85)
6 106A BIDADARI PK DR 341106 POINT (32222.45 35147.86)
7 106A PUNGGOL FIELD 821106 POINT (36349.34 42055.12)
8 107 JLN RAJAH 320107 POINT (30412.22 34452.8)
9 107A EDGEFIELD PLAINS 821107 POINT (36180.37 42120.79)
10 108A CANBERRA WALK 751108 POINT (27772.55 47711.53)
4.4 Filtering Using Boundaries
# Perform the left join on the common column, 'address'
joined_data <- coords_sf %>%
left_join(resale_tidy, by = "address")
# Check the class to confirm it's still an sf object with geometry retained
print(class(joined_data)) # Should return `sf` if geometry is intact[1] "sf" "data.frame"
mpsz <- read_rds("data/geospatial/mpsz.rds")
mpszSimple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
First 10 features:
SUBZONE_N SUBZONE_C PLN_AREA_N PLN_AREA_C REGION_N
1 MARINA EAST MESZ01 MARINA EAST ME CENTRAL REGION
2 INSTITUTION HILL RVSZ05 RIVER VALLEY RV CENTRAL REGION
3 ROBERTSON QUAY SRSZ01 SINGAPORE RIVER SR CENTRAL REGION
4 JURONG ISLAND AND BUKOM WISZ01 WESTERN ISLANDS WI WEST REGION
5 FORT CANNING MUSZ02 MUSEUM MU CENTRAL REGION
6 MARINA EAST (MP) MPSZ05 MARINE PARADE MP CENTRAL REGION
7 SUDONG WISZ03 WESTERN ISLANDS WI WEST REGION
8 SEMAKAU WISZ02 WESTERN ISLANDS WI WEST REGION
9 SOUTHERN GROUP SISZ02 SOUTHERN ISLANDS SI CENTRAL REGION
10 SENTOSA SISZ01 SOUTHERN ISLANDS SI CENTRAL REGION
REGION_C geometry
1 CR MULTIPOLYGON (((33222.98 29...
2 CR MULTIPOLYGON (((28481.45 30...
3 CR MULTIPOLYGON (((28087.34 30...
4 WR MULTIPOLYGON (((14557.7 304...
5 CR MULTIPOLYGON (((29542.53 31...
6 CR MULTIPOLYGON (((35279.55 30...
7 WR MULTIPOLYGON (((15772.59 21...
8 WR MULTIPOLYGON (((19843.41 21...
9 CR MULTIPOLYGON (((30870.53 22...
10 CR MULTIPOLYGON (((26879.04 26...
mpsz_sf <- st_as_sf(mpsz, mpsz = c("longitude", "latitude"), crs = 4326) %>%
st_transform(crs = 3414)library(tmap)
# Set tmap to static mode
tmap_mode("plot")
# Create a basic map with `mpsz`
tm_shape(mpsz_sf) +
tm_polygons()
library(sf)
library(dplyr)
# Step 1: Convert MULTIPOLYGON geometries to individual POLYGON geometries
mpsz_polygons <- st_cast(mpsz, "POLYGON")
# Step 2: Calculate the area for each POLYGON
mpsz_polygons <- mpsz_polygons %>%
mutate(area = st_area(geometry))
# Step 3: Group by original MULTIPOLYGON identifier and keep the largest POLYGON by area
# Replace `unique_id` with the column that uniquely identifies each MULTIPOLYGON
mpsz_largest_polygons <- mpsz_polygons %>%
group_by(SUBZONE_N) %>%
filter(area == max(area)) %>%
ungroup()
# Print the result to verify
print(mpsz_largest_polygons)Simple feature collection with 332 features and 7 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 19021.55 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
# A tibble: 332 × 8
SUBZONE_N SUBZONE_C PLN_AREA_N PLN_AREA_C REGION_N REGION_C
<chr> <chr> <chr> <chr> <chr> <chr>
1 MARINA EAST MESZ01 MARINA EAST ME CENTRAL… CR
2 INSTITUTION HILL RVSZ05 RIVER VALLEY RV CENTRAL… CR
3 ROBERTSON QUAY SRSZ01 SINGAPORE RIV… SR CENTRAL… CR
4 JURONG ISLAND AND BUKOM WISZ01 WESTERN ISLAN… WI WEST RE… WR
5 FORT CANNING MUSZ02 MUSEUM MU CENTRAL… CR
6 MARINA EAST (MP) MPSZ05 MARINE PARADE MP CENTRAL… CR
7 SUDONG WISZ03 WESTERN ISLAN… WI WEST RE… WR
8 SEMAKAU WISZ02 WESTERN ISLAN… WI WEST RE… WR
9 SOUTHERN GROUP SISZ02 SOUTHERN ISLA… SI CENTRAL… CR
10 SENTOSA SISZ01 SOUTHERN ISLA… SI CENTRAL… CR
# ℹ 322 more rows
# ℹ 2 more variables: geometry <POLYGON [m]>, area [m^2]
library(tmap)
# Set tmap to static mode
tmap_mode("plot")
# Create a basic map with `mpsz`
tm_shape(mpsz_largest_polygons) +
tm_polygons()
final_dataset <- joined_data[st_within(joined_data, mpsz_largest_polygons, sparse = FALSE), ]final_dataset <- st_as_sf(mpsz, mpsz = c("longitude", "latitude"), crs = 4326) %>%
st_transform(crs = 3414)library(tmap)
# Set tmap to static mode
tmap_mode("plot")
# Create a basic map with `mpsz`
tm_shape(final_dataset) +
tm_dots()
This code refines spatial data processing and visualization by ensuring consistent coordinate reference systems (CRS) for accurate spatial joins and mapping.
CRS Alignment: First, both
joined_data(the points data) andmpsz_largest_polygons(the largest subzone polygons) are transformed to the SVY21 coordinate system (EPSG:3414) for accurate spatial alignment.Spatial Join: A spatial join is performed using
st_joinwith thest_withinjoin type. This operation links each point injoined_datato its containing polygon inmpsz_largest_polygons, effectively adding polygon attributes to each point for further analysis.Verification: The resulting
final_datasetis printed to verify that the join has successfully added polygon attributes to each point.Mapping in Static Mode: The
tmaplibrary is used to create a static map showing both subzone polygons and points. The polygons are displayed in light blue with a dark blue border, while the points are plotted in red, giving a clear view of which points fall within each subzone.Dynamic Map in WGS84 and SVY21: The code then reprojects
joined_databetween WGS84 (EPSG:4326) and SVY21 (EPSG:3414) to provide flexibility in viewing modes. Usingtmap_mode("view"), it displays a dynamic, interactive map ofjoined_data, allowing for exploration of the spatial data.
This approach confirms the spatial relationships between points and polygons, facilitating further geospatial analysis and visualization.
library(sf)
library(dplyr)
# Step 1: Make sure both datasets are in the same CRS
# Assuming you want to use EPSG:3414 (SVY21)
joined_data <- st_transform(joined_data, crs = 3414)
mpsz_largest_polygons <- st_transform(mpsz_largest_polygons, crs = 3414)
# Step 2: Perform a spatial join to combine points with their respective boundaries
# This will add the polygon attributes to each point in joined_data
final_dataset <- st_join(joined_data, mpsz_largest_polygons, join = st_within)
# Step 3: Verify the results
print(final_dataset)Simple feature collection with 2052 features and 22 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
address postal month town flat_type block
1 101 SIMEI ST 1 520101 2023-07 TAMPINES 5 ROOM 101
2 101 SIMEI ST 1 520101 2024-09 TAMPINES 5 ROOM 101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG 5 ROOM 102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG 5 ROOM 102
5 102A PUNGGOL FIELD 821102 2024-08 PUNGGOL 5 ROOM 102A
6 102A PUNGGOL FIELD 821102 2024-09 PUNGGOL 5 ROOM 102A
7 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
8 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
9 103 CLEMENTI ST 14 120103 2023-07 CLEMENTI 5 ROOM 103
10 103 CLEMENTI ST 14 120103 2024-07 CLEMENTI 5 ROOM 103
street_name storey_range floor_area_sqm flat_model lease_commence_date
1 SIMEI ST 1 01 TO 03 128 Improved 1988
2 SIMEI ST 1 01 TO 03 122 Improved 1988
3 TECK WHYE LANE 01 TO 03 121 Improved 1989
4 TECK WHYE LANE 01 TO 03 121 Improved 1989
5 PUNGGOL FIELD 01 TO 03 110 Improved 2002
6 PUNGGOL FIELD 16 TO 18 110 Improved 2002
7 CLEMENTI ST 14 10 TO 12 121 Improved 1984
8 CLEMENTI ST 14 07 TO 09 123 Improved 1984
9 CLEMENTI ST 14 04 TO 06 121 Improved 1984
10 CLEMENTI ST 14 13 TO 15 123 Improved 1984
remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months 680000 64 3
2 63 years 01 month 715000 63 1
3 64 years 11 months 520000 64 11
4 63 years 08 months 575000 63 8
5 77 years 04 months 603888 77 4
6 77 years 03 months 655000 77 3
7 60 years 08 months 808800 60 8
8 60 years 08 months 800000 60 8
9 60 years 05 months 758000 60 5
10 59 years 06 months 500000 59 6
SUBZONE_N SUBZONE_C PLN_AREA_N PLN_AREA_C REGION_N REGION_C
1 SIMEI TMSZ04 TAMPINES TM EAST REGION ER
2 SIMEI TMSZ04 TAMPINES TM EAST REGION ER
3 TECK WHYE CKSZ01 CHOA CHU KANG CK WEST REGION WR
4 TECK WHYE CKSZ01 CHOA CHU KANG CK WEST REGION WR
5 PUNGGOL FIELD PGSZ04 PUNGGOL PG NORTH-EAST REGION NER
6 PUNGGOL FIELD PGSZ04 PUNGGOL PG NORTH-EAST REGION NER
7 SUNSET WAY CLSZ02 CLEMENTI CL WEST REGION WR
8 SUNSET WAY CLSZ02 CLEMENTI CL WEST REGION WR
9 SUNSET WAY CLSZ02 CLEMENTI CL WEST REGION WR
10 SUNSET WAY CLSZ02 CLEMENTI CL WEST REGION WR
area geometry
1 2777496.6 [m^2] POINT (41144.95 35890.2)
2 2777496.6 [m^2] POINT (41144.95 35890.2)
3 1003202.6 [m^2] POINT (19216.72 39896.49)
4 1003202.6 [m^2] POINT (19216.72 39896.49)
5 1378710.5 [m^2] POINT (36064.38 42301.51)
6 1378710.5 [m^2] POINT (36064.38 42301.51)
7 963924.6 [m^2] POINT (20861.59 33870.73)
8 963924.6 [m^2] POINT (20861.59 33870.73)
9 963924.6 [m^2] POINT (20861.59 33870.73)
10 963924.6 [m^2] POINT (20861.59 33870.73)
library(tmap)
# Set tmap to plot mode for a static map
tmap_mode("plot")
# Plot both polygons and points
tm_shape(mpsz_largest_polygons) +
tm_polygons(alpha = 0.3, col = "lightblue", border.col = "darkblue") +
tm_shape(final_dataset) +
tm_dots(col = "red", size = 0.2) +
tm_layout(main.title = "Points within Boundaries", legend.outside = TRUE)
# Transform joined_data to WGS84 coordinate system (EPSG:4326)
joined_data <- st_transform(joined_data, crs = st_crs(4326))
# Transform joined_data to SVY21 coordinate system (EPSG:3414)
joined_data <- st_transform(joined_data, crs = st_crs(3414))tmap_mode("view")
tm_shape(joined_data)+
tm_dots()ADDING DERIVED VARIABLES
You loaded eldercare facility data from Data.gov.sg, transformed it to match the coordinate system of your HDB resale dataset, and then calculated the distance from each HDB resale location to the nearest eldercare facility. This distance was added as a new variable in the dataset to analyze accessibility to eldercare services for each resale location.
eldercare <- st_read(dsn = "data/geospatial/EldercareServicesSHP", layer = "ELDERCARE")Reading layer `ELDERCARE' from data source
`C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\EldercareServicesSHP'
using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
eldercare <- st_transform(eldercare, crs = 3414)library(sf)
library(dplyr)
# Ensure both datasets are in the same CRS
joined_data <- st_transform(joined_data, crs = 3414)
eldercare <- st_transform(eldercare, crs = 3414)
# Step 1: Find the nearest eldercare point for each HDB point
nearest_eldercare_index <- st_nearest_feature(joined_data, eldercare)
# Step 2: Calculate the distance to the nearest eldercare point
joined_data <- joined_data %>%
mutate(distance_to_eldercare = as.numeric(st_distance(geometry, eldercare[nearest_eldercare_index, ], by_element = TRUE)))
# Check the updated dataset
print(joined_data)Simple feature collection with 2052 features and 16 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
address postal month town flat_type block
1 101 SIMEI ST 1 520101 2023-07 TAMPINES 5 ROOM 101
2 101 SIMEI ST 1 520101 2024-09 TAMPINES 5 ROOM 101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG 5 ROOM 102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG 5 ROOM 102
5 102A PUNGGOL FIELD 821102 2024-08 PUNGGOL 5 ROOM 102A
6 102A PUNGGOL FIELD 821102 2024-09 PUNGGOL 5 ROOM 102A
7 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
8 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
9 103 CLEMENTI ST 14 120103 2023-07 CLEMENTI 5 ROOM 103
10 103 CLEMENTI ST 14 120103 2024-07 CLEMENTI 5 ROOM 103
street_name storey_range floor_area_sqm flat_model lease_commence_date
1 SIMEI ST 1 01 TO 03 128 Improved 1988
2 SIMEI ST 1 01 TO 03 122 Improved 1988
3 TECK WHYE LANE 01 TO 03 121 Improved 1989
4 TECK WHYE LANE 01 TO 03 121 Improved 1989
5 PUNGGOL FIELD 01 TO 03 110 Improved 2002
6 PUNGGOL FIELD 16 TO 18 110 Improved 2002
7 CLEMENTI ST 14 10 TO 12 121 Improved 1984
8 CLEMENTI ST 14 07 TO 09 123 Improved 1984
9 CLEMENTI ST 14 04 TO 06 121 Improved 1984
10 CLEMENTI ST 14 13 TO 15 123 Improved 1984
remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months 680000 64 3
2 63 years 01 month 715000 63 1
3 64 years 11 months 520000 64 11
4 63 years 08 months 575000 63 8
5 77 years 04 months 603888 77 4
6 77 years 03 months 655000 77 3
7 60 years 08 months 808800 60 8
8 60 years 08 months 800000 60 8
9 60 years 05 months 758000 60 5
10 59 years 06 months 500000 59 6
geometry distance_to_eldercare
1 POINT (41144.95 35890.2) 758.1380
2 POINT (41144.95 35890.2) 758.1380
3 POINT (19216.72 39896.49) 327.6085
4 POINT (19216.72 39896.49) 327.6085
5 POINT (36064.38 42301.51) 701.7365
6 POINT (36064.38 42301.51) 701.7365
7 POINT (20861.59 33870.73) 378.3587
8 POINT (20861.59 33870.73) 378.3587
9 POINT (20861.59 33870.73) 378.3587
10 POINT (20861.59 33870.73) 378.3587
This code calculates the distance from each HDB resale location to Raffles Place, which represents Singapore’s Central Business District (CBD). First, the coordinates for Raffles Place (103.8514, 1.2839 in WGS84) are defined and then transformed to match the CRS of the HDB dataset (EPSG:3414). Using st_distance, the distance from each HDB point to the CBD is computed and added to joined_data as a new column, distance_to_cbd, to assess proximity to the CBD for each resale property.
library(sf)
library(dplyr)
# Ensure joined_data is in the projected CRS (e.g., EPSG:3414)
joined_data <- st_transform(joined_data, crs = 3414)
# Define the CBD point (Raffles Place coordinates in WGS84)
cbd_point <- st_sfc(st_point(c(103.8514, 1.2839)), crs = 4326) # WGS84
# Transform CBD point to the same CRS as joined_data (EPSG:3414)
cbd_point <- st_transform(cbd_point, crs = 3414)
# Calculate the distance from each HDB point to the CBD point
joined_data <- joined_data %>%
mutate(distance_to_cbd = as.numeric(st_distance(geometry, cbd_point)[, 1]))
# View the updated dataset with distance to CBD
print(joined_data)Simple feature collection with 2052 features and 17 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
address postal month town flat_type block
1 101 SIMEI ST 1 520101 2023-07 TAMPINES 5 ROOM 101
2 101 SIMEI ST 1 520101 2024-09 TAMPINES 5 ROOM 101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG 5 ROOM 102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG 5 ROOM 102
5 102A PUNGGOL FIELD 821102 2024-08 PUNGGOL 5 ROOM 102A
6 102A PUNGGOL FIELD 821102 2024-09 PUNGGOL 5 ROOM 102A
7 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
8 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
9 103 CLEMENTI ST 14 120103 2023-07 CLEMENTI 5 ROOM 103
10 103 CLEMENTI ST 14 120103 2024-07 CLEMENTI 5 ROOM 103
street_name storey_range floor_area_sqm flat_model lease_commence_date
1 SIMEI ST 1 01 TO 03 128 Improved 1988
2 SIMEI ST 1 01 TO 03 122 Improved 1988
3 TECK WHYE LANE 01 TO 03 121 Improved 1989
4 TECK WHYE LANE 01 TO 03 121 Improved 1989
5 PUNGGOL FIELD 01 TO 03 110 Improved 2002
6 PUNGGOL FIELD 16 TO 18 110 Improved 2002
7 CLEMENTI ST 14 10 TO 12 121 Improved 1984
8 CLEMENTI ST 14 07 TO 09 123 Improved 1984
9 CLEMENTI ST 14 04 TO 06 121 Improved 1984
10 CLEMENTI ST 14 13 TO 15 123 Improved 1984
remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months 680000 64 3
2 63 years 01 month 715000 63 1
3 64 years 11 months 520000 64 11
4 63 years 08 months 575000 63 8
5 77 years 04 months 603888 77 4
6 77 years 03 months 655000 77 3
7 60 years 08 months 808800 60 8
8 60 years 08 months 800000 60 8
9 60 years 05 months 758000 60 5
10 59 years 06 months 500000 59 6
geometry distance_to_eldercare distance_to_cbd
1 POINT (41144.95 35890.2) 758.1380 12790.41
2 POINT (41144.95 35890.2) 758.1380 12790.41
3 POINT (19216.72 39896.49) 327.6085 14923.59
4 POINT (19216.72 39896.49) 327.6085 14923.59
5 POINT (36064.38 42301.51) 701.7365 14076.30
6 POINT (36064.38 42301.51) 701.7365 14076.30
7 POINT (20861.59 33870.73) 378.3587 10101.36
8 POINT (20861.59 33870.73) 378.3587 10101.36
9 POINT (20861.59 33870.73) 378.3587 10101.36
10 POINT (20861.59 33870.73) 378.3587 10101.36
This code calculates the distance from each HDB resale location to the nearest hawker center. It first ensures that both joined_data (HDB resale data) and hawker (hawker center locations) are in the same coordinate reference system (EPSG:3414). Then, using st_nearest_feature, it identifies the index of the closest hawker center for each HDB location. The st_distance function calculates the distance from each HDB point to its nearest hawker center, adding this as a new column, distance_to_hawker, in the joined_data dataset. This additional variable enables analysis of proximity to hawker centers for each resale property.
hawker <- st_read("data/geospatial/HawkerCentresKML.kml")Reading layer `HAWKERCENTRE' from data source
`C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\HawkerCentresKML.kml'
using driver `KML'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
library(sf)
library(dplyr)
# Ensure both datasets are in the same CRS
joined_data <- st_transform(joined_data, crs = 3414)
hawker <- st_transform(hawker, crs = 3414) # Assuming hawker dataset is in sf format
# Step 1: Find the index of the nearest hawker center for each HDB point
nearest_hawker_index <- st_nearest_feature(joined_data, hawker)
# Step 2: Calculate the distance to the nearest hawker center
joined_data <- joined_data %>%
mutate(distance_to_hawker = as.numeric(st_distance(geometry, hawker[nearest_hawker_index, ], by_element = TRUE)))
# View the updated dataset with the new distance column
print(joined_data)Simple feature collection with 2052 features and 18 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
address postal month town flat_type block
1 101 SIMEI ST 1 520101 2023-07 TAMPINES 5 ROOM 101
2 101 SIMEI ST 1 520101 2024-09 TAMPINES 5 ROOM 101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG 5 ROOM 102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG 5 ROOM 102
5 102A PUNGGOL FIELD 821102 2024-08 PUNGGOL 5 ROOM 102A
6 102A PUNGGOL FIELD 821102 2024-09 PUNGGOL 5 ROOM 102A
7 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
8 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
9 103 CLEMENTI ST 14 120103 2023-07 CLEMENTI 5 ROOM 103
10 103 CLEMENTI ST 14 120103 2024-07 CLEMENTI 5 ROOM 103
street_name storey_range floor_area_sqm flat_model lease_commence_date
1 SIMEI ST 1 01 TO 03 128 Improved 1988
2 SIMEI ST 1 01 TO 03 122 Improved 1988
3 TECK WHYE LANE 01 TO 03 121 Improved 1989
4 TECK WHYE LANE 01 TO 03 121 Improved 1989
5 PUNGGOL FIELD 01 TO 03 110 Improved 2002
6 PUNGGOL FIELD 16 TO 18 110 Improved 2002
7 CLEMENTI ST 14 10 TO 12 121 Improved 1984
8 CLEMENTI ST 14 07 TO 09 123 Improved 1984
9 CLEMENTI ST 14 04 TO 06 121 Improved 1984
10 CLEMENTI ST 14 13 TO 15 123 Improved 1984
remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months 680000 64 3
2 63 years 01 month 715000 63 1
3 64 years 11 months 520000 64 11
4 63 years 08 months 575000 63 8
5 77 years 04 months 603888 77 4
6 77 years 03 months 655000 77 3
7 60 years 08 months 808800 60 8
8 60 years 08 months 800000 60 8
9 60 years 05 months 758000 60 5
10 59 years 06 months 500000 59 6
geometry distance_to_eldercare distance_to_cbd
1 POINT (41144.95 35890.2) 758.1380 12790.41
2 POINT (41144.95 35890.2) 758.1380 12790.41
3 POINT (19216.72 39896.49) 327.6085 14923.59
4 POINT (19216.72 39896.49) 327.6085 14923.59
5 POINT (36064.38 42301.51) 701.7365 14076.30
6 POINT (36064.38 42301.51) 701.7365 14076.30
7 POINT (20861.59 33870.73) 378.3587 10101.36
8 POINT (20861.59 33870.73) 378.3587 10101.36
9 POINT (20861.59 33870.73) 378.3587 10101.36
10 POINT (20861.59 33870.73) 378.3587 10101.36
distance_to_hawker
1 924.5228
2 924.5228
3 1333.7754
4 1333.7754
5 1040.3345
6 1040.3345
7 927.6112
8 927.6112
9 927.6112
10 927.6112
This code loads shopping mall location data from a CSV file (sourced from Kaggle), converts it into a spatial sf object using the latitude and longitude columns, and sets the coordinate reference system to WGS84 (EPSG:4326). Following the same approach as with the eldercare and hawker center data, it ensures both the HDB dataset and shopping mall locations are in the same coordinate system, enabling spatial analysis. Using spatial functions, it then calculates the distance from each HDB resale location to the nearest shopping mall, adding this as a new column to the HDB dataset for proximity analysis.
library(sf)
library(dplyr)
# Step 1: Read the CSV file (replace 'malls.csv' with the path to your file)
malls_data <- read.csv("data/aspatial/shopping_mall_coordinates.csv")
# Step 2: Convert to sf object using latitude and longitude columns
# Assuming the latitude and longitude columns are named "lat" and "long"
malls_sf <- malls_data %>%
st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326) # WGS84 CRS
# View the resulting sf object to confirm it has geometry
print(malls_sf)Simple feature collection with 155 features and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 103.6972 ymin: 1.274588 xmax: 103.99 ymax: 1.448231
Geodetic CRS: WGS 84
First 10 features:
Mall.Name geometry
1 100 AM POINT (103.8435 1.274588)
2 313@Somerset POINT (103.8384 1.301014)
3 Aperia POINT (103.8643 1.310474)
4 Balestier Hill Shopping Centre POINT (103.8426 1.325596)
5 Bugis Cube POINT (103.8556 1.298141)
6 Bugis Junction POINT (103.8554 1.299113)
7 Bugis+ POINT (103.8552 1.300952)
8 Capitol Piazza POINT (103.8513 1.293079)
9 Cathay Cineleisure Orchard POINT (103.8365 1.301464)
10 The Centrepoint POINT (103.84 1.30145)
malls_sf <- st_transform(malls_sf, crs = 3414)library(sf)
library(dplyr)
# Ensure both datasets are in the same CRS (e.g., EPSG:3414 for SVY21 in Singapore)
joined_data <- st_transform(joined_data, crs = 3414)
malls_sf <- st_transform(malls_sf, crs = 3414)
# Step 1: Find the index of the nearest mall for each HDB point
nearest_mall_index <- st_nearest_feature(joined_data, malls_sf)
# Step 2: Calculate the distance to the nearest mall
joined_data <- joined_data %>%
mutate(distance_to_mall = as.numeric(st_distance(geometry, malls_sf[nearest_mall_index, ], by_element = TRUE)))
# View the updated dataset with the new distance column
print(joined_data)Simple feature collection with 2052 features and 19 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
address postal month town flat_type block
1 101 SIMEI ST 1 520101 2023-07 TAMPINES 5 ROOM 101
2 101 SIMEI ST 1 520101 2024-09 TAMPINES 5 ROOM 101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG 5 ROOM 102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG 5 ROOM 102
5 102A PUNGGOL FIELD 821102 2024-08 PUNGGOL 5 ROOM 102A
6 102A PUNGGOL FIELD 821102 2024-09 PUNGGOL 5 ROOM 102A
7 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
8 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
9 103 CLEMENTI ST 14 120103 2023-07 CLEMENTI 5 ROOM 103
10 103 CLEMENTI ST 14 120103 2024-07 CLEMENTI 5 ROOM 103
street_name storey_range floor_area_sqm flat_model lease_commence_date
1 SIMEI ST 1 01 TO 03 128 Improved 1988
2 SIMEI ST 1 01 TO 03 122 Improved 1988
3 TECK WHYE LANE 01 TO 03 121 Improved 1989
4 TECK WHYE LANE 01 TO 03 121 Improved 1989
5 PUNGGOL FIELD 01 TO 03 110 Improved 2002
6 PUNGGOL FIELD 16 TO 18 110 Improved 2002
7 CLEMENTI ST 14 10 TO 12 121 Improved 1984
8 CLEMENTI ST 14 07 TO 09 123 Improved 1984
9 CLEMENTI ST 14 04 TO 06 121 Improved 1984
10 CLEMENTI ST 14 13 TO 15 123 Improved 1984
remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months 680000 64 3
2 63 years 01 month 715000 63 1
3 64 years 11 months 520000 64 11
4 63 years 08 months 575000 63 8
5 77 years 04 months 603888 77 4
6 77 years 03 months 655000 77 3
7 60 years 08 months 808800 60 8
8 60 years 08 months 800000 60 8
9 60 years 05 months 758000 60 5
10 59 years 06 months 500000 59 6
geometry distance_to_eldercare distance_to_cbd
1 POINT (41144.95 35890.2) 758.1380 12790.41
2 POINT (41144.95 35890.2) 758.1380 12790.41
3 POINT (19216.72 39896.49) 327.6085 14923.59
4 POINT (19216.72 39896.49) 327.6085 14923.59
5 POINT (36064.38 42301.51) 701.7365 14076.30
6 POINT (36064.38 42301.51) 701.7365 14076.30
7 POINT (20861.59 33870.73) 378.3587 10101.36
8 POINT (20861.59 33870.73) 378.3587 10101.36
9 POINT (20861.59 33870.73) 378.3587 10101.36
10 POINT (20861.59 33870.73) 378.3587 10101.36
distance_to_hawker distance_to_mall
1 924.5228 271.0456
2 924.5228 271.0456
3 1333.7754 390.7374
4 1333.7754 390.7374
5 1040.3345 748.0958
6 1040.3345 748.0958
7 927.6112 1001.6526
8 927.6112 1001.6526
9 927.6112 1001.6526
10 927.6112 1001.6526
This code processes train station data from a KML file, converting it to a spatial format and setting the coordinate reference system to EPSG:3414 (SVY21) for Singapore. It begins by checking for and identifying any invalid geometries in train_stations, printing them for review. After confirming both train_stations and joined_data (HDB resale data) are in the same CRS, the code calculates the distance from each HDB resale location to the nearest train station using spatial functions. This distance is added as a new column, distance_to_mrt, in the joined_data dataset, providing a measure of proximity to MRT stations for further analysis.
train_stations <- st_read(dsn = "data/geospatial/LTAMRTStationExitKML.kml")Reading layer `MRT_EXITS' from data source
`C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\LTAMRTStationExitKML.kml'
using driver `KML'
Simple feature collection with 563 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6368 ymin: 1.264972 xmax: 103.9893 ymax: 1.449157
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
train_stations <- st_transform(train_stations, crs = 3414)library(sf)
library(dplyr)
# Ensure train_stations is in the desired CRS (e.g., EPSG:3414)
train_stations <- st_transform(train_stations, crs = 3414)
# Step 1: Identify invalid polygons
invalid_polygons <- train_stations %>%
filter(!st_is_valid(geometry))
# Step 2: View invalid polygons
print(invalid_polygons)Simple feature collection with 0 features and 2 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] Name Description geometry
<0 rows> (or 0-length row.names)
# Ensure both datasets are in the same CRS (e.g., EPSG:3414 for SVY21 in Singapore)
joined_data <- st_transform(joined_data, crs = 3414)
train_sf <- st_transform(train_stations, crs = 3414)
# Step 1: Find the index of the nearest mall for each HDB point
nearest_mrt_index <- st_nearest_feature(joined_data, train_sf)
# Step 2: Calculate the distance to the nearest mall
joined_data <- joined_data %>%
mutate(distance_to_mrt = as.numeric(st_distance(geometry, train_sf[nearest_mrt_index, ], by_element = TRUE)))
# View the updated dataset with the new distance column
print(joined_data)Simple feature collection with 2052 features and 20 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
address postal month town flat_type block
1 101 SIMEI ST 1 520101 2023-07 TAMPINES 5 ROOM 101
2 101 SIMEI ST 1 520101 2024-09 TAMPINES 5 ROOM 101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG 5 ROOM 102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG 5 ROOM 102
5 102A PUNGGOL FIELD 821102 2024-08 PUNGGOL 5 ROOM 102A
6 102A PUNGGOL FIELD 821102 2024-09 PUNGGOL 5 ROOM 102A
7 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
8 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
9 103 CLEMENTI ST 14 120103 2023-07 CLEMENTI 5 ROOM 103
10 103 CLEMENTI ST 14 120103 2024-07 CLEMENTI 5 ROOM 103
street_name storey_range floor_area_sqm flat_model lease_commence_date
1 SIMEI ST 1 01 TO 03 128 Improved 1988
2 SIMEI ST 1 01 TO 03 122 Improved 1988
3 TECK WHYE LANE 01 TO 03 121 Improved 1989
4 TECK WHYE LANE 01 TO 03 121 Improved 1989
5 PUNGGOL FIELD 01 TO 03 110 Improved 2002
6 PUNGGOL FIELD 16 TO 18 110 Improved 2002
7 CLEMENTI ST 14 10 TO 12 121 Improved 1984
8 CLEMENTI ST 14 07 TO 09 123 Improved 1984
9 CLEMENTI ST 14 04 TO 06 121 Improved 1984
10 CLEMENTI ST 14 13 TO 15 123 Improved 1984
remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months 680000 64 3
2 63 years 01 month 715000 63 1
3 64 years 11 months 520000 64 11
4 63 years 08 months 575000 63 8
5 77 years 04 months 603888 77 4
6 77 years 03 months 655000 77 3
7 60 years 08 months 808800 60 8
8 60 years 08 months 800000 60 8
9 60 years 05 months 758000 60 5
10 59 years 06 months 500000 59 6
geometry distance_to_eldercare distance_to_cbd
1 POINT (41144.95 35890.2) 758.1380 12790.41
2 POINT (41144.95 35890.2) 758.1380 12790.41
3 POINT (19216.72 39896.49) 327.6085 14923.59
4 POINT (19216.72 39896.49) 327.6085 14923.59
5 POINT (36064.38 42301.51) 701.7365 14076.30
6 POINT (36064.38 42301.51) 701.7365 14076.30
7 POINT (20861.59 33870.73) 378.3587 10101.36
8 POINT (20861.59 33870.73) 378.3587 10101.36
9 POINT (20861.59 33870.73) 378.3587 10101.36
10 POINT (20861.59 33870.73) 378.3587 10101.36
distance_to_hawker distance_to_mall distance_to_mrt
1 924.5228 271.0456 332.54769
2 924.5228 271.0456 332.54769
3 1333.7754 390.7374 85.37721
4 1333.7754 390.7374 85.37721
5 1040.3345 748.0958 36.02671
6 1040.3345 748.0958 36.02671
7 927.6112 1001.6526 903.86491
8 927.6112 1001.6526 903.86491
9 927.6112 1001.6526 903.86491
10 927.6112 1001.6526 903.86491
This code loads park data from a KML file, converts it to an sf spatial format, and sets the coordinate reference system to EPSG:3414 (SVY21) for consistency with the HDB resale data. After ensuring that both joined_data (HDB data) and parks are in the same CRS, it calculates the distance from each HDB resale location to the nearest park. This distance is added as a new column, distance_to_parks, in the joined_data dataset, allowing for analysis of proximity to parks for each resale location.
parks <- st_read(dsn = "data/geospatial/Parks.kml")Reading layer `NATIONALPARKS' from data source
`C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\Parks.kml'
using driver `KML'
Simple feature collection with 430 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6929 ymin: 1.214491 xmax: 104.0538 ymax: 1.462094
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
parks<- st_transform(parks, crs = 3414)# Ensure both datasets are in the same CRS (e.g., EPSG:3414 for SVY21 in Singapore)
joined_data <- st_transform(joined_data, crs = 3414)
parks_sf <- st_transform(parks, crs = 3414)
# Step 1: Find the index of the nearest mall for each HDB point
nearest_parks_index <- st_nearest_feature(joined_data, parks_sf)
# Step 2: Calculate the distance to the nearest mall
joined_data <- joined_data %>%
mutate(distance_to_parks = as.numeric(st_distance(geometry, parks_sf[nearest_parks_index, ], by_element = TRUE)))
# View the updated dataset with the new distance column
print(joined_data)Simple feature collection with 2052 features and 21 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
address postal month town flat_type block
1 101 SIMEI ST 1 520101 2023-07 TAMPINES 5 ROOM 101
2 101 SIMEI ST 1 520101 2024-09 TAMPINES 5 ROOM 101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG 5 ROOM 102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG 5 ROOM 102
5 102A PUNGGOL FIELD 821102 2024-08 PUNGGOL 5 ROOM 102A
6 102A PUNGGOL FIELD 821102 2024-09 PUNGGOL 5 ROOM 102A
7 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
8 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
9 103 CLEMENTI ST 14 120103 2023-07 CLEMENTI 5 ROOM 103
10 103 CLEMENTI ST 14 120103 2024-07 CLEMENTI 5 ROOM 103
street_name storey_range floor_area_sqm flat_model lease_commence_date
1 SIMEI ST 1 01 TO 03 128 Improved 1988
2 SIMEI ST 1 01 TO 03 122 Improved 1988
3 TECK WHYE LANE 01 TO 03 121 Improved 1989
4 TECK WHYE LANE 01 TO 03 121 Improved 1989
5 PUNGGOL FIELD 01 TO 03 110 Improved 2002
6 PUNGGOL FIELD 16 TO 18 110 Improved 2002
7 CLEMENTI ST 14 10 TO 12 121 Improved 1984
8 CLEMENTI ST 14 07 TO 09 123 Improved 1984
9 CLEMENTI ST 14 04 TO 06 121 Improved 1984
10 CLEMENTI ST 14 13 TO 15 123 Improved 1984
remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months 680000 64 3
2 63 years 01 month 715000 63 1
3 64 years 11 months 520000 64 11
4 63 years 08 months 575000 63 8
5 77 years 04 months 603888 77 4
6 77 years 03 months 655000 77 3
7 60 years 08 months 808800 60 8
8 60 years 08 months 800000 60 8
9 60 years 05 months 758000 60 5
10 59 years 06 months 500000 59 6
geometry distance_to_eldercare distance_to_cbd
1 POINT (41144.95 35890.2) 758.1380 12790.41
2 POINT (41144.95 35890.2) 758.1380 12790.41
3 POINT (19216.72 39896.49) 327.6085 14923.59
4 POINT (19216.72 39896.49) 327.6085 14923.59
5 POINT (36064.38 42301.51) 701.7365 14076.30
6 POINT (36064.38 42301.51) 701.7365 14076.30
7 POINT (20861.59 33870.73) 378.3587 10101.36
8 POINT (20861.59 33870.73) 378.3587 10101.36
9 POINT (20861.59 33870.73) 378.3587 10101.36
10 POINT (20861.59 33870.73) 378.3587 10101.36
distance_to_hawker distance_to_mall distance_to_mrt distance_to_parks
1 924.5228 271.0456 332.54769 256.4781
2 924.5228 271.0456 332.54769 256.4781
3 1333.7754 390.7374 85.37721 341.9659
4 1333.7754 390.7374 85.37721 341.9659
5 1040.3345 748.0958 36.02671 828.0045
6 1040.3345 748.0958 36.02671 828.0045
7 927.6112 1001.6526 903.86491 459.1670
8 927.6112 1001.6526 903.86491 459.1670
9 927.6112 1001.6526 903.86491 459.1670
10 927.6112 1001.6526 903.86491 459.1670
This code calculates the number of preschools and schools within a 400-meter radius of each HDB resale location. It first loads and transforms the preschool and school zone data into the same coordinate reference system (EPSG:3414) as the HDB data in joined_data. Then, it creates a 400-meter buffer around each HDB point to represent a walkable area. Using st_intersects, it identifies which preschools and schools fall within each buffer and counts them. These counts are added to joined_data as new columns, preschools_within_400m and schools_within_400m, to support analysis of nearby educational facilities for each HDB resale property.
preschools <- st_read("data/geospatial/PreSchoolsLocation.kml")Reading layer `PRESCHOOLS_LOCATION' from data source
`C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\PreSchoolsLocation.kml'
using driver `KML'
Simple feature collection with 2290 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6878 ymin: 1.247759 xmax: 103.9897 ymax: 1.462134
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
school_zones <- st_read(dsn = "data/geospatial/SchoolZone_Jan2024/SchoolZone_Jan2024", layer="SCHOOLZONE") Reading layer `SCHOOLZONE' from data source
`C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\SchoolZone_Jan2024\SchoolZone_Jan2024'
using driver `ESRI Shapefile'
Simple feature collection with 211 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 11717.22 ymin: 28358.21 xmax: 42850.04 ymax: 48798.1
Projected CRS: SVY21
school_zones <- st_transform(school_zones, crs = 3414)# Ensure both datasets (HDB points and preschools/schools) are in the same CRS
joined_data <- st_transform(joined_data, crs = 3414) # Assuming HDB data is in joined_data
preschools <- st_transform(preschools, crs = 3414) # Assuming preschools is an sf object
schools <- st_transform(school_zones, crs = 3414) # Assuming schools is an sf object
# Create a 500m buffer around each HDB point
hdb_buffers <- st_buffer(joined_data, dist = 400)# Step 2: Count preschools within each HDB buffer
# Use st_intersects to identify preschools within each buffer
preschools_within_buffer <- st_intersects(hdb_buffers, preschools)
joined_data$preschools_within_400m <- sapply(preschools_within_buffer, length)
# Step 3: Count schools within each HDB buffer
schools_within_buffer <- st_intersects(hdb_buffers, schools)
joined_data$schools_within_400m <- sapply(schools_within_buffer, length)
# View the updated dataset to check counts
print(joined_data)Simple feature collection with 2052 features and 23 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
address postal month town flat_type block
1 101 SIMEI ST 1 520101 2023-07 TAMPINES 5 ROOM 101
2 101 SIMEI ST 1 520101 2024-09 TAMPINES 5 ROOM 101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG 5 ROOM 102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG 5 ROOM 102
5 102A PUNGGOL FIELD 821102 2024-08 PUNGGOL 5 ROOM 102A
6 102A PUNGGOL FIELD 821102 2024-09 PUNGGOL 5 ROOM 102A
7 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
8 103 CLEMENTI ST 14 120103 2023-04 CLEMENTI 5 ROOM 103
9 103 CLEMENTI ST 14 120103 2023-07 CLEMENTI 5 ROOM 103
10 103 CLEMENTI ST 14 120103 2024-07 CLEMENTI 5 ROOM 103
street_name storey_range floor_area_sqm flat_model lease_commence_date
1 SIMEI ST 1 01 TO 03 128 Improved 1988
2 SIMEI ST 1 01 TO 03 122 Improved 1988
3 TECK WHYE LANE 01 TO 03 121 Improved 1989
4 TECK WHYE LANE 01 TO 03 121 Improved 1989
5 PUNGGOL FIELD 01 TO 03 110 Improved 2002
6 PUNGGOL FIELD 16 TO 18 110 Improved 2002
7 CLEMENTI ST 14 10 TO 12 121 Improved 1984
8 CLEMENTI ST 14 07 TO 09 123 Improved 1984
9 CLEMENTI ST 14 04 TO 06 121 Improved 1984
10 CLEMENTI ST 14 13 TO 15 123 Improved 1984
remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months 680000 64 3
2 63 years 01 month 715000 63 1
3 64 years 11 months 520000 64 11
4 63 years 08 months 575000 63 8
5 77 years 04 months 603888 77 4
6 77 years 03 months 655000 77 3
7 60 years 08 months 808800 60 8
8 60 years 08 months 800000 60 8
9 60 years 05 months 758000 60 5
10 59 years 06 months 500000 59 6
geometry distance_to_eldercare distance_to_cbd
1 POINT (41144.95 35890.2) 758.1380 12790.41
2 POINT (41144.95 35890.2) 758.1380 12790.41
3 POINT (19216.72 39896.49) 327.6085 14923.59
4 POINT (19216.72 39896.49) 327.6085 14923.59
5 POINT (36064.38 42301.51) 701.7365 14076.30
6 POINT (36064.38 42301.51) 701.7365 14076.30
7 POINT (20861.59 33870.73) 378.3587 10101.36
8 POINT (20861.59 33870.73) 378.3587 10101.36
9 POINT (20861.59 33870.73) 378.3587 10101.36
10 POINT (20861.59 33870.73) 378.3587 10101.36
distance_to_hawker distance_to_mall distance_to_mrt distance_to_parks
1 924.5228 271.0456 332.54769 256.4781
2 924.5228 271.0456 332.54769 256.4781
3 1333.7754 390.7374 85.37721 341.9659
4 1333.7754 390.7374 85.37721 341.9659
5 1040.3345 748.0958 36.02671 828.0045
6 1040.3345 748.0958 36.02671 828.0045
7 927.6112 1001.6526 903.86491 459.1670
8 927.6112 1001.6526 903.86491 459.1670
9 927.6112 1001.6526 903.86491 459.1670
10 927.6112 1001.6526 903.86491 459.1670
preschools_within_400m schools_within_400m
1 5 1
2 5 1
3 7 0
4 7 0
5 10 1
6 10 1
7 3 0
8 3 0
9 3 0
10 3 0
write_rds(joined_data, "data/rds/dataset_no_log_transformed.rds") dataset_no_log_transformed <- read_rds("data/rds/dataset_no_log_transformed.rds")This code converts the distance variables in joined_data from meters to kilometers for easier interpretation. It divides each distance column—distance_to_eldercare, distance_to_cbd, distance_to_hawker, distance_to_mall, distance_to_mrt, and distance_to_parks—by 1,000, updating these values to represent distances in kilometers. This transformation simplifies analysis and comparison across distance-related metrics.
library(dplyr)
# Convert distance variables from meters to kilometers
joined_data <- joined_data %>%
mutate(
distance_to_eldercare = distance_to_eldercare / 1000,
distance_to_cbd = distance_to_cbd / 1000,
distance_to_hawker = distance_to_hawker / 1000,
distance_to_mall = distance_to_mall / 1000,
distance_to_mrt = distance_to_mrt / 1000,
distance_to_parks = distance_to_parks / 1000
)DATASET VARIABLES
In this dataset, several derived variables have been created from the original data, representing factors that could be relevant to the predictive model:
Total Remaining Lease (in months): A new variable,
total_remaining_lease_mths, calculates the total remaining lease time by converting years and months into a single metric in months. This helps standardize the remaining lease duration for analysis.Years Since Lease Commencement: The
years_since_lease_commencevariable calculates the age of the lease from its commencement up to the year 2024, providing insight into the property’s age.Median Storey: For properties with a storey range, the
median_storeyvariable estimates the average storey level by taking the midpoint between the minimum and maximum storeys in the range, giving a useful measure of building height.
These derived variables enhance the dataset by capturing relevant factors that can potentially impact property value in the predictive model.
library(dplyr)
# Add a new column for total remaining lease in months
joined_data <- joined_data %>%
mutate(total_remaining_lease_mths = (remaining_lease_yr * 12) + remaining_lease_mth)
# View the updated dataset to confirm the new column
head(joined_data)Simple feature collection with 6 features and 24 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 19216.72 ymin: 35890.2 xmax: 41144.95 ymax: 42301.51
Projected CRS: SVY21 / Singapore TM
address postal month town flat_type block
1 101 SIMEI ST 1 520101 2023-07 TAMPINES 5 ROOM 101
2 101 SIMEI ST 1 520101 2024-09 TAMPINES 5 ROOM 101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG 5 ROOM 102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG 5 ROOM 102
5 102A PUNGGOL FIELD 821102 2024-08 PUNGGOL 5 ROOM 102A
6 102A PUNGGOL FIELD 821102 2024-09 PUNGGOL 5 ROOM 102A
street_name storey_range floor_area_sqm flat_model lease_commence_date
1 SIMEI ST 1 01 TO 03 128 Improved 1988
2 SIMEI ST 1 01 TO 03 122 Improved 1988
3 TECK WHYE LANE 01 TO 03 121 Improved 1989
4 TECK WHYE LANE 01 TO 03 121 Improved 1989
5 PUNGGOL FIELD 01 TO 03 110 Improved 2002
6 PUNGGOL FIELD 16 TO 18 110 Improved 2002
remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months 680000 64 3
2 63 years 01 month 715000 63 1
3 64 years 11 months 520000 64 11
4 63 years 08 months 575000 63 8
5 77 years 04 months 603888 77 4
6 77 years 03 months 655000 77 3
geometry distance_to_eldercare distance_to_cbd
1 POINT (41144.95 35890.2) 0.7581380 12.79041
2 POINT (41144.95 35890.2) 0.7581380 12.79041
3 POINT (19216.72 39896.49) 0.3276085 14.92359
4 POINT (19216.72 39896.49) 0.3276085 14.92359
5 POINT (36064.38 42301.51) 0.7017365 14.07630
6 POINT (36064.38 42301.51) 0.7017365 14.07630
distance_to_hawker distance_to_mall distance_to_mrt distance_to_parks
1 0.9245228 0.2710456 0.33254769 0.2564781
2 0.9245228 0.2710456 0.33254769 0.2564781
3 1.3337754 0.3907374 0.08537721 0.3419659
4 1.3337754 0.3907374 0.08537721 0.3419659
5 1.0403345 0.7480958 0.03602671 0.8280045
6 1.0403345 0.7480958 0.03602671 0.8280045
preschools_within_400m schools_within_400m total_remaining_lease_mths
1 5 1 771
2 5 1 757
3 7 0 779
4 7 0 764
5 10 1 928
6 10 1 927
library(dplyr)
# Add a new column calculating the age of the lease based on 2024
joined_data <- joined_data %>%
mutate(years_since_lease_commence = 2024 - lease_commence_date)
# View the updated dataset
head(joined_data)Simple feature collection with 6 features and 25 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 19216.72 ymin: 35890.2 xmax: 41144.95 ymax: 42301.51
Projected CRS: SVY21 / Singapore TM
address postal month town flat_type block
1 101 SIMEI ST 1 520101 2023-07 TAMPINES 5 ROOM 101
2 101 SIMEI ST 1 520101 2024-09 TAMPINES 5 ROOM 101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG 5 ROOM 102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG 5 ROOM 102
5 102A PUNGGOL FIELD 821102 2024-08 PUNGGOL 5 ROOM 102A
6 102A PUNGGOL FIELD 821102 2024-09 PUNGGOL 5 ROOM 102A
street_name storey_range floor_area_sqm flat_model lease_commence_date
1 SIMEI ST 1 01 TO 03 128 Improved 1988
2 SIMEI ST 1 01 TO 03 122 Improved 1988
3 TECK WHYE LANE 01 TO 03 121 Improved 1989
4 TECK WHYE LANE 01 TO 03 121 Improved 1989
5 PUNGGOL FIELD 01 TO 03 110 Improved 2002
6 PUNGGOL FIELD 16 TO 18 110 Improved 2002
remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months 680000 64 3
2 63 years 01 month 715000 63 1
3 64 years 11 months 520000 64 11
4 63 years 08 months 575000 63 8
5 77 years 04 months 603888 77 4
6 77 years 03 months 655000 77 3
geometry distance_to_eldercare distance_to_cbd
1 POINT (41144.95 35890.2) 0.7581380 12.79041
2 POINT (41144.95 35890.2) 0.7581380 12.79041
3 POINT (19216.72 39896.49) 0.3276085 14.92359
4 POINT (19216.72 39896.49) 0.3276085 14.92359
5 POINT (36064.38 42301.51) 0.7017365 14.07630
6 POINT (36064.38 42301.51) 0.7017365 14.07630
distance_to_hawker distance_to_mall distance_to_mrt distance_to_parks
1 0.9245228 0.2710456 0.33254769 0.2564781
2 0.9245228 0.2710456 0.33254769 0.2564781
3 1.3337754 0.3907374 0.08537721 0.3419659
4 1.3337754 0.3907374 0.08537721 0.3419659
5 1.0403345 0.7480958 0.03602671 0.8280045
6 1.0403345 0.7480958 0.03602671 0.8280045
preschools_within_400m schools_within_400m total_remaining_lease_mths
1 5 1 771
2 5 1 757
3 7 0 779
4 7 0 764
5 10 1 928
6 10 1 927
years_since_lease_commence
1 36
2 36
3 35
4 35
5 22
6 22
library(dplyr)
library(stringr)
# Calculate the median storey for each range
joined_data <- joined_data %>%
mutate(
# Extract the lower and upper limits of the storey range
storey_min = as.numeric(str_extract(storey_range, "^\\d+")),
storey_max = as.numeric(str_extract(storey_range, "\\d+$")),
# Calculate the median storey as the average of min and max
median_storey = (storey_min + storey_max) / 2
)
# View the updated dataset
head(joined_data)Simple feature collection with 6 features and 28 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 19216.72 ymin: 35890.2 xmax: 41144.95 ymax: 42301.51
Projected CRS: SVY21 / Singapore TM
address postal month town flat_type block
1 101 SIMEI ST 1 520101 2023-07 TAMPINES 5 ROOM 101
2 101 SIMEI ST 1 520101 2024-09 TAMPINES 5 ROOM 101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG 5 ROOM 102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG 5 ROOM 102
5 102A PUNGGOL FIELD 821102 2024-08 PUNGGOL 5 ROOM 102A
6 102A PUNGGOL FIELD 821102 2024-09 PUNGGOL 5 ROOM 102A
street_name storey_range floor_area_sqm flat_model lease_commence_date
1 SIMEI ST 1 01 TO 03 128 Improved 1988
2 SIMEI ST 1 01 TO 03 122 Improved 1988
3 TECK WHYE LANE 01 TO 03 121 Improved 1989
4 TECK WHYE LANE 01 TO 03 121 Improved 1989
5 PUNGGOL FIELD 01 TO 03 110 Improved 2002
6 PUNGGOL FIELD 16 TO 18 110 Improved 2002
remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months 680000 64 3
2 63 years 01 month 715000 63 1
3 64 years 11 months 520000 64 11
4 63 years 08 months 575000 63 8
5 77 years 04 months 603888 77 4
6 77 years 03 months 655000 77 3
geometry distance_to_eldercare distance_to_cbd
1 POINT (41144.95 35890.2) 0.7581380 12.79041
2 POINT (41144.95 35890.2) 0.7581380 12.79041
3 POINT (19216.72 39896.49) 0.3276085 14.92359
4 POINT (19216.72 39896.49) 0.3276085 14.92359
5 POINT (36064.38 42301.51) 0.7017365 14.07630
6 POINT (36064.38 42301.51) 0.7017365 14.07630
distance_to_hawker distance_to_mall distance_to_mrt distance_to_parks
1 0.9245228 0.2710456 0.33254769 0.2564781
2 0.9245228 0.2710456 0.33254769 0.2564781
3 1.3337754 0.3907374 0.08537721 0.3419659
4 1.3337754 0.3907374 0.08537721 0.3419659
5 1.0403345 0.7480958 0.03602671 0.8280045
6 1.0403345 0.7480958 0.03602671 0.8280045
preschools_within_400m schools_within_400m total_remaining_lease_mths
1 5 1 771
2 5 1 757
3 7 0 779
4 7 0 764
5 10 1 928
6 10 1 927
years_since_lease_commence storey_min storey_max median_storey
1 36 1 3 2
2 36 1 3 2
3 35 1 3 2
4 35 1 3 2
5 22 1 3 2
6 22 16 18 17
library(ggplot2)
# Distance to nearest eldercare
ggplot(joined_data, aes(x = distance_to_eldercare)) +
geom_density(fill = "skyblue", alpha = 0.5) +
labs(title = "Density of Distance to Nearest Eldercare",
x = "Distance to Eldercare (meters)", y = "Density")
# Distance to nearest CBD
ggplot(joined_data, aes(x = distance_to_cbd)) +
geom_density(fill = "skyblue", alpha = 0.5) +
labs(title = "Density of Distance to Nearest CBD",
x = "Distance to CBD (meters)", y = "Density")
# Distance to nearest hawker center
ggplot(joined_data, aes(x = distance_to_hawker)) +
geom_density(fill = "lightgreen", alpha = 0.5) +
labs(title = "Density of Distance to Nearest Hawker",
x = "Distance to Hawker (meters)", y = "Density")
# Distance to nearest mall
ggplot(joined_data, aes(x = distance_to_mall)) +
geom_density(fill = "lightgreen", alpha = 0.5) +
labs(title = "Density of Distance to Nearest Mall",
x = "Distance to Mall (meters)", y = "Density")
# Distance to nearest MRT station
ggplot(joined_data, aes(x = distance_to_mrt)) +
geom_density(fill = "coral", alpha = 0.5) +
labs(title = "Density of Distance to Nearest MRT Station",
x = "Distance to MRT Station (meters)", y = "Density")
# Distance to nearest park
ggplot(joined_data, aes(x = distance_to_parks)) +
geom_density(fill = "coral", alpha = 0.5) +
labs(title = "Density of Distance to Nearest Park",
x = "Distance to Park (meters)", y = "Density")
# Number of preschools within 500m
ggplot(joined_data, aes(x = preschools_within_400m)) +
geom_histogram(binwidth = 1, fill = "purple", color = "black") +
labs(title = "Histogram of Preschools within 500m",
x = "Number of Preschools within 500m", y = "Frequency")
# Number of schools within 500m
ggplot(joined_data, aes(x = schools_within_400m)) +
geom_histogram(binwidth = 1, fill = "orange", color = "black") +
labs(title = "Histogram of Schools within 500m",
x = "Number of Schools within 500m", y = "Frequency")
ggplot(joined_data, aes(x = median_storey)) +
geom_histogram(binwidth = 1, fill = "orange", color = "black") +
labs(title = "Histogram of Median Storey",
x = "Number of Schools within 500m", y = "Frequency")
ggplot(joined_data, aes(x = total_remaining_lease_mths)) +
geom_histogram(binwidth = 1, fill = "orange", color = "black") +
labs(title = "Histogram of Total Remaining Lease Months",
x = "Number of Schools within 500m", y = "Frequency")
ggplot(joined_data, aes(x = years_since_lease_commence)) +
geom_histogram(binwidth = 1, fill = "orange", color = "black") +
labs(title = "Histogram of Years since lease commence",
x = "Number of Schools within 500m", y = "Frequency")
ggplot(joined_data, aes(x = floor_area_sqm)) +
geom_histogram(binwidth = 1, fill = "orange", color = "black") +
labs(title = "Histogram of floor area sqm",
x = "Number of Schools within 500m", y = "Frequency")
ggplot(joined_data, aes(x = resale_price)) +
geom_density(binwidth = 1, fill = "orange", color = "black") +
labs(title = "Histogram of resale price",
x = "Number of Schools within 500m", y = "Frequency")
Since several variables did not appear to follow a normal distribution, a log transformation was applied to make them more suitable for analysis in the GWR model. Adding 1 to each variable prevents issues with zero values, ensuring valid log transformations. The resulting variables include log-transformed versions of distance measures, nearby amenity counts, median storey, remaining lease, years since lease commencement, and floor area. This transformation helps to stabilize variance, normalize distributions, and potentially enhance model interpretability and performance.
library(dplyr)
# Log-transform all distance and count variables, adding +1 to avoid log(0) issues
joined_data <- joined_data %>%
mutate(
log_distance_to_eldercare = log(distance_to_eldercare + 1),
log_distance_to_cbd = log(distance_to_cbd + 1),
log_distance_to_hawker = log(distance_to_hawker + 1),
log_distance_to_mall = log(distance_to_mall + 1),
log_distance_to_mrt = log(distance_to_mrt + 1),
log_distance_to_parks = log(distance_to_parks + 1),
log_preschools_within_400m = log(preschools_within_400m + 1),
log_schools_within_400m = log(schools_within_400m + 1),
log_median_storey = log(median_storey + 1),
log_total_remaining_lease = log(total_remaining_lease_mths + 1),
log_years_since_lease_commence = log(years_since_lease_commence + 1),
log_floor_area_sqm = log(floor_area_sqm + 1)
)library(ggplot2)
# Density plots for log-transformed distance variables
ggplot(joined_data, aes(x = log_distance_to_eldercare)) +
geom_density(fill = "skyblue", alpha = 0.5) +
labs(title = "Log Density of Distance to Nearest Eldercare",
x = "Log Distance to Eldercare", y = "Density")
ggplot(joined_data, aes(x = log_distance_to_cbd)) +
geom_density(fill = "skyblue", alpha = 0.5) +
labs(title = "Log Density of Distance to Nearest CBD",
x = "Log Distance to CBD", y = "Density")
ggplot(joined_data, aes(x = log_distance_to_hawker)) +
geom_density(fill = "lightgreen", alpha = 0.5) +
labs(title = "Log Density of Distance to Nearest Hawker",
x = "Log Distance to Hawker", y = "Density")
ggplot(joined_data, aes(x = log_distance_to_mall)) +
geom_density(fill = "lightgreen", alpha = 0.5) +
labs(title = "Log Density of Distance to Nearest Mall",
x = "Log Distance to Mall", y = "Density")
ggplot(joined_data, aes(x = log_distance_to_mrt)) +
geom_density(fill = "coral", alpha = 0.5) +
labs(title = "Log Density of Distance to Nearest MRT Station",
x = "Log Distance to MRT Station", y = "Density")
ggplot(joined_data, aes(x = log_distance_to_parks)) +
geom_density(fill = "coral", alpha = 0.5) +
labs(title = "Log Density of Distance to Nearest Park",
x = "Log Distance to Park", y = "Density")
# Histograms for log-transformed count variables
ggplot(joined_data, aes(x = log_preschools_within_400m)) +
geom_histogram(binwidth = 0.1, fill = "purple", color = "black") +
labs(title = "Log Histogram of Preschools within 500m",
x = "Log Number of Preschools within 500m", y = "Frequency")
ggplot(joined_data, aes(x = log_schools_within_400m)) +
geom_histogram(binwidth = 0.1, fill = "orange", color = "black") +
labs(title = "Log Histogram of Schools within 500m",
x = "Log Number of Schools within 500m", y = "Frequency")
ggplot(joined_data, aes(x = log_median_storey)) +
geom_histogram(binwidth = 1, fill = "orange", color = "black") +
labs(title = "Histogram of Median Storey",
x = "Number of Schools within 500m", y = "Frequency")
ggplot(joined_data, aes(x = log_total_remaining_lease)) +
geom_density(binwidth = 1, fill = "orange", color = "black") +
labs(title = "Histogram of Total Remaining Lease Months",
x = "Number of Schools within 500m", y = "Frequency")
ggplot(joined_data, aes(x = log_years_since_lease_commence)) +
geom_density(binwidth = 1, fill = "orange", color = "black") +
labs(title = "Histogram of Years Since Lease Commence",
x = "Number of Schools within 500m", y = "Frequency")
ggplot(joined_data, aes(x = log_floor_area_sqm)) +
geom_density(binwidth = 1, fill = "orange", color = "black") +
labs(title = "Histogram of Floor Area sqm",
x = "Number of Schools within 500m", y = "Frequency")
library(dplyr)
library(sf) # for st_drop_geometry
library(corrplot)
# Step 1: Drop the geometry column
numeric_data <- joined_data %>%
st_drop_geometry() %>% # Remove the geometry column
select(
log_distance_to_eldercare, log_distance_to_cbd, log_distance_to_hawker,
log_distance_to_mall, log_distance_to_mrt, log_distance_to_parks,
log_preschools_within_400m, log_schools_within_400m,
log_median_storey,log_total_remaining_lease, resale_price, log_years_since_lease_commence,log_floor_area_sqm
)
# Step 2: Compute the correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs") # Use "complete.obs" to handle missing data
# Print the correlation matrix
print(cor_matrix) log_distance_to_eldercare log_distance_to_cbd
log_distance_to_eldercare 1.000000000 0.30118662
log_distance_to_cbd 0.301186622 1.00000000
log_distance_to_hawker 0.111039508 0.34659052
log_distance_to_mall 0.009091574 -0.17058707
log_distance_to_mrt 0.078492235 0.03578689
log_distance_to_parks -0.017131482 0.37375049
log_preschools_within_400m 0.023055243 0.11970527
log_schools_within_400m 0.013676719 0.09870777
log_median_storey -0.094844163 -0.19613562
log_total_remaining_lease 0.079828286 0.24744941
resale_price -0.235732333 -0.64274729
log_years_since_lease_commence -0.029641406 -0.18218299
log_floor_area_sqm -0.061250887 -0.08959270
log_distance_to_hawker log_distance_to_mall
log_distance_to_eldercare 0.11103951 0.0090915740
log_distance_to_cbd 0.34659052 -0.1705870741
log_distance_to_hawker 1.00000000 -0.1627287931
log_distance_to_mall -0.16272879 1.0000000000
log_distance_to_mrt -0.04772654 0.3227345062
log_distance_to_parks 0.11737431 -0.0004525176
log_preschools_within_400m 0.08556389 -0.2903878034
log_schools_within_400m 0.01876348 -0.0989959691
log_median_storey -0.12826378 -0.0240286633
log_total_remaining_lease 0.09655721 -0.1667559150
resale_price -0.31634902 0.0343916380
log_years_since_lease_commence -0.04537941 0.1022365068
log_floor_area_sqm 0.00454388 0.1757466027
log_distance_to_mrt log_distance_to_parks
log_distance_to_eldercare 0.078492235 -0.0171314819
log_distance_to_cbd 0.035786885 0.3737504940
log_distance_to_hawker -0.047726540 0.1173743134
log_distance_to_mall 0.322734506 -0.0004525176
log_distance_to_mrt 1.000000000 0.1100091533
log_distance_to_parks 0.110009153 1.0000000000
log_preschools_within_400m -0.214465568 0.0313474403
log_schools_within_400m -0.217188092 0.0420952613
log_median_storey -0.057106989 -0.0733363997
log_total_remaining_lease -0.005694969 0.1992314719
resale_price -0.069614079 -0.2118702786
log_years_since_lease_commence -0.076149471 -0.1950705844
log_floor_area_sqm 0.179682293 -0.0341265725
log_preschools_within_400m
log_distance_to_eldercare 0.023055243
log_distance_to_cbd 0.119705265
log_distance_to_hawker 0.085563887
log_distance_to_mall -0.290387803
log_distance_to_mrt -0.214465568
log_distance_to_parks 0.031347440
log_preschools_within_400m 1.000000000
log_schools_within_400m 0.279534178
log_median_storey -0.046616068
log_total_remaining_lease 0.004644186
resale_price -0.084728010
log_years_since_lease_commence 0.091914221
log_floor_area_sqm -0.017896573
log_schools_within_400m log_median_storey
log_distance_to_eldercare 0.01367672 -0.09484416
log_distance_to_cbd 0.09870777 -0.19613562
log_distance_to_hawker 0.01876348 -0.12826378
log_distance_to_mall -0.09899597 -0.02402866
log_distance_to_mrt -0.21718809 -0.05710699
log_distance_to_parks 0.04209526 -0.07333640
log_preschools_within_400m 0.27953418 -0.04661607
log_schools_within_400m 1.00000000 -0.01690392
log_median_storey -0.01690392 1.00000000
log_total_remaining_lease -0.25783108 0.11000803
resale_price -0.21226633 0.37665733
log_years_since_lease_commence 0.36913042 -0.10850119
log_floor_area_sqm 0.06575264 -0.16608588
log_total_remaining_lease resale_price
log_distance_to_eldercare 0.079828286 -0.23573233
log_distance_to_cbd 0.247449415 -0.64274729
log_distance_to_hawker 0.096557208 -0.31634902
log_distance_to_mall -0.166755915 0.03439164
log_distance_to_mrt -0.005694969 -0.06961408
log_distance_to_parks 0.199231472 -0.21187028
log_preschools_within_400m 0.004644186 -0.08472801
log_schools_within_400m -0.257831083 -0.21226633
log_median_storey 0.110008033 0.37665733
log_total_remaining_lease 1.000000000 0.22722493
resale_price 0.227224926 1.00000000
log_years_since_lease_commence -0.940797181 -0.29619826
log_floor_area_sqm -0.624374748 -0.02192839
log_years_since_lease_commence
log_distance_to_eldercare -0.02964141
log_distance_to_cbd -0.18218299
log_distance_to_hawker -0.04537941
log_distance_to_mall 0.10223651
log_distance_to_mrt -0.07614947
log_distance_to_parks -0.19507058
log_preschools_within_400m 0.09191422
log_schools_within_400m 0.36913042
log_median_storey -0.10850119
log_total_remaining_lease -0.94079718
resale_price -0.29619826
log_years_since_lease_commence 1.00000000
log_floor_area_sqm 0.53335781
log_floor_area_sqm
log_distance_to_eldercare -0.06125089
log_distance_to_cbd -0.08959270
log_distance_to_hawker 0.00454388
log_distance_to_mall 0.17574660
log_distance_to_mrt 0.17968229
log_distance_to_parks -0.03412657
log_preschools_within_400m -0.01789657
log_schools_within_400m 0.06575264
log_median_storey -0.16608588
log_total_remaining_lease -0.62437475
resale_price -0.02192839
log_years_since_lease_commence 0.53335781
log_floor_area_sqm 1.00000000
# Step 3: Visualize the correlation matrix (optional)
corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45, title = "Correlation Matrix")
This code calculates and visualizes the correlation matrix for log-transformed and numeric variables in the joined_data dataset, excluding the spatial geometry. Missing values in numeric columns are imputed with the median to ensure completeness. The correlation matrix helps identify relationships among variables for further analysis.
In the correlation matrix, the variables log_total_remaining_lease and log_floor_area_sqm show a strong negative correlation. This suggests that as the total remaining lease (log-transformed) decreases, the floor area (log-transformed) tends to increase, or vice versa. This relationship could reflect a trend where older properties, which generally have less remaining lease, might also tend to have larger floor areas.
library(dplyr)
# Impute missing values in numerical columns with the median
joined_data <- joined_data %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))The dataset is split into training and testing sets with an 80/20 proportion, where 80% of the data is used for training the model, and 20% is reserved for testing its performance. This ensures that the model is trained on a substantial portion of the data while retaining enough data for evaluating its predictive accuracy.
This linear regression model (price_mlr) aims to predict HDB resale prices using log-transformed distance variables and property characteristics as predictors. The model shows that several factors—such as distance to the CBD, MRT, parks, the number of nearby preschools and schools, and the median storey—are significant predictors of resale prices. With an adjusted R-squared of 0.6283, the model explains approximately 63% of the variance in resale prices, indicating a moderately strong fit.
set.seed(1234)
resale_split <- initial_split(joined_data,
prop = 8/10,)
train_data <- training(resale_split)
test_data <- testing(resale_split)# Correct model formula with + between variables
price_mlr <- lm(
resale_price ~ log_distance_to_eldercare + log_distance_to_cbd + log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt + log_distance_to_parks + log_preschools_within_400m + log_schools_within_400m + log_median_storey + log_years_since_lease_commence,
data = train_data
)
# Check the model summary
summary(price_mlr)
Call:
lm(formula = resale_price ~ log_distance_to_eldercare + log_distance_to_cbd +
log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt +
log_distance_to_parks + log_preschools_within_400m + log_schools_within_400m +
log_median_storey + log_years_since_lease_commence, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-283194 -62333 -10150 51438 423761
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1613614 29354 54.971 < 2e-16 ***
log_distance_to_eldercare -17138 8247 -2.078 0.037865 *
log_distance_to_cbd -258477 7718 -33.489 < 2e-16 ***
log_distance_to_hawker -53686 9209 -5.830 6.68e-09 ***
log_distance_to_mall -3401 13078 -0.260 0.794866
log_distance_to_mrt -41930 10881 -3.854 0.000121 ***
log_distance_to_parks -17405 10667 -1.632 0.102950
log_preschools_within_400m 12705 6231 2.039 0.041612 *
log_schools_within_400m -8768 5743 -1.527 0.127042
log_median_storey 48225 3974 12.135 < 2e-16 ***
log_years_since_lease_commence -95171 3928 -24.228 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 92660 on 1630 degrees of freedom
Multiple R-squared: 0.642, Adjusted R-squared: 0.6399
F-statistic: 292.4 on 10 and 1630 DF, p-value: < 2.2e-16
This code uses the Variance Inflation Factor (VIF) to check for multicollinearity in the global regression model predicting resale_price. The VIF values for each predictor in global_model will help identify variables with high collinearity (typically indicated by VIF values greater than 5 or 10). High VIF values suggest that some predictors may be strongly correlated with others, which could impact the model’s stability and interpretability. Reducing multicollinearity may involve removing or combining highly correlated variables.
train_data_sp <- as_Spatial(train_data)
train_data_spclass : SpatialPointsDataFrame
features : 1641
extent : 11755.72, 42441.24, 28457.97, 48339.17 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
variables : 40
names : address, postal, month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease, resale_price, remaining_lease_yr, remaining_lease_mth, ...
min values : 101 SIMEI ST 1, 080109, 2023-01, ANG MO KIO, 5 ROOM, 101, ADMIRALTY DR, 01 TO 03, 104, 3Gen, 1974, 48 years 11 months, 470000, 48, 1, ...
max values : 997C BUANGKOK CRES, 824662, 2024-09, YISHUN, 5 ROOM, 997C, YUNG KUANG RD, 46 TO 48, 150, Type S2, 2020, 95 years 05 months, 1588000, 95, 11, ...
library(car)
# Run VIF on the global model to check for multicollinearity
global_model <- lm(resale_price ~ log_distance_to_eldercare + log_distance_to_cbd +
log_distance_to_hawker + log_distance_to_mall +
log_distance_to_mrt + log_distance_to_parks +
log_preschools_within_400m + log_schools_within_400m +
log_median_storey + log_years_since_lease_commence +
log_floor_area_sqm,
data = train_data_sp)
vif(global_model) log_distance_to_eldercare log_distance_to_cbd
1.153131 1.574228
log_distance_to_hawker log_distance_to_mall
1.158551 1.283349
log_distance_to_mrt log_distance_to_parks
1.287822 1.240598
log_preschools_within_400m log_schools_within_400m
1.200550 1.336858
log_median_storey log_years_since_lease_commence
1.091242 1.821248
log_floor_area_sqm
1.582944
Based on these VIF results, there is no indication of multicollinearity among the predictors, as all VIF values are below 2. This suggests that none of the variables are excessively correlated with each other, supporting the model’s stability and interpretability.
This code determines an optimal adaptive bandwidth (bw_adaptive) for a Geographically Weighted Regression (GWR) model that predicts resale_price using various log-transformed distance and property variables. Using cross-validation (approach = "CV") with a Gaussian kernel, the code adjusts the bandwidth based on local data density, which allows the model to account for spatial variation more accurately. The longlat = FALSE parameter specifies planar distance calculations, aligning with Singapore’s SVY21 coordinate system. This adaptive bandwidth helps capture local effects more effectively across different data densities within the region.
bw_adaptive <- bw.gwr(
resale_price ~ log_distance_to_eldercare + log_distance_to_cbd +
log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt +
log_distance_to_parks + log_preschools_within_400m +
log_schools_within_400m + log_median_storey +
log_years_since_lease_commence + log_floor_area_sqm,
data = train_data_sp,
approach = "CV",
kernel = "gaussian",
adaptive = TRUE,
longlat = FALSE
)Take a cup of tea and have a break, it will take a few minutes.
-----A kind suggestion from GWmodel development group
Adaptive bandwidth: 1021 CV score: 1.140465e+13
Adaptive bandwidth: 639 CV score: 1.074045e+13
Adaptive bandwidth: 401 CV score: 9.800083e+12
Adaptive bandwidth: 256 CV score: 9.064662e+12
Adaptive bandwidth: 164 CV score: 8.295959e+12
Adaptive bandwidth: 109 CV score: 6.803495e+12
Adaptive bandwidth: 73 CV score: 5.696787e+12
Adaptive bandwidth: 53 CV score: 4.968459e+12
Adaptive bandwidth: 38 CV score: 9.272527e+14
Adaptive bandwidth: 59 CV score: 5.217215e+12
Adaptive bandwidth: 45 CV score: 4.692108e+12
Adaptive bandwidth: 44 CV score: 4.625041e+12
Adaptive bandwidth: 39 CV score: 1.262828e+15
Adaptive bandwidth: 42 CV score: 4.60572e+12
Adaptive bandwidth: 46 CV score: 4.708241e+12
Adaptive bandwidth: 44 CV score: 4.625041e+12
Adaptive bandwidth: 45 CV score: 4.692108e+12
Adaptive bandwidth: 44 CV score: 4.625041e+12
Adaptive bandwidth: 44 CV score: 4.625041e+12
Adaptive bandwidth: 43 CV score: 4.604421e+12
Adaptive bandwidth: 41 CV score: 7.172298e+14
Adaptive bandwidth: 41 CV score: 7.172298e+14
Adaptive bandwidth: 40 CV score: 7.171396e+14
Adaptive bandwidth: 40 CV score: 7.171396e+14
Adaptive bandwidth: 39 CV score: 1.262828e+15
Adaptive bandwidth: 39 CV score: 1.262828e+15
Adaptive bandwidth: 38 CV score: 9.272527e+14
Adaptive bandwidth: 38 CV score: 9.272527e+14
Adaptive bandwidth: 37 CV score: 9.272069e+14
Adaptive bandwidth: 37 CV score: 9.272069e+14
Adaptive bandwidth: 36 CV score: 9.394113e+14
Adaptive bandwidth: 36 CV score: 9.394113e+14
Adaptive bandwidth: 35 CV score: 9.393867e+14
Adaptive bandwidth: 35 CV score: 9.393867e+14
Adaptive bandwidth: 34 CV score: 1.025499e+15
Adaptive bandwidth: 34 CV score: 1.025499e+15
This code runs an adaptive Geographically Weighted Regression (GWR) model (gwr_adaptive) to predict resale_price using various log-transformed distance and property characteristics. The model applies the optimal adaptive bandwidth (bw_adaptive) calculated previously, allowing it to vary based on data density. A Gaussian kernel is used to assign weights, and adaptive=TRUE enables the model to capture local spatial variations more accurately. With longlat = FALSE, distances are calculated in planar units, suitable for the SVY21 projection system. This GWR model helps reveal how the influence of each variable changes across different locations.
gwr_adaptive <- gwr.basic(formula = resale_price ~log_distance_to_eldercare + log_distance_to_cbd + log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt + log_distance_to_parks + log_preschools_within_400m + log_schools_within_400m + log_median_storey + log_years_since_lease_commence + log_floor_area_sqm,
data=train_data_sp,
bw=bw_adaptive,
kernel = 'gaussian',
adaptive=TRUE,
longlat = FALSE)This output summarizes the results of the Geographically Weighted Regression (GWR) model for predicting HDB resale prices.
Global Regression Results: The global model shows an adjusted R-squared of 0.6958, explaining around 70% of the variance in resale prices with significant predictors including distances to CBD, MRT stations, parks, nearby amenities, and property characteristics like storey level and lease age.
GWR Model Performance: The GWR model, with an adaptive bandwidth that uses 34 nearest neighbors, demonstrates improved performance with an R-squared of 0.9374, indicating that the model explains around 94% of the variance in resale prices when accounting for spatial variation. The diagnostic metrics (AICc, AIC, and BIC) suggest a better model fit compared to the global regression.
Spatial Variation in Coefficients: The GWR coefficients vary across locations, as shown in the summary with ranges for each predictor. This variation reflects the differing influence of factors like proximity to amenities and lease duration across different areas, highlighting the model’s ability to capture local effects.
gwr_adaptive ***********************************************************************
* Package GWmodel *
***********************************************************************
Program starts at: 2024-11-09 19:49:51.140074
Call:
gwr.basic(formula = resale_price ~ log_distance_to_eldercare +
log_distance_to_cbd + log_distance_to_hawker + log_distance_to_mall +
log_distance_to_mrt + log_distance_to_parks + log_preschools_within_400m +
log_schools_within_400m + log_median_storey + log_years_since_lease_commence +
log_floor_area_sqm, data = train_data_sp, bw = bw_adaptive,
kernel = "gaussian", adaptive = TRUE, longlat = FALSE)
Dependent (y) variable: resale_price
Independent variables: log_distance_to_eldercare log_distance_to_cbd log_distance_to_hawker log_distance_to_mall log_distance_to_mrt log_distance_to_parks log_preschools_within_400m log_schools_within_400m log_median_storey log_years_since_lease_commence log_floor_area_sqm
Number of data points: 1641
***********************************************************************
* Results of Global Regression *
***********************************************************************
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-246438 -53643 -5081 41269 357502
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2361196 227104 -10.397 < 2e-16 ***
log_distance_to_eldercare -3085 7602 -0.406 0.68490
log_distance_to_cbd -257779 7075 -36.433 < 2e-16 ***
log_distance_to_hawker -60918 8452 -7.208 8.67e-13 ***
log_distance_to_mall -14074 12004 -1.172 0.24117
log_distance_to_mrt -79198 10196 -7.768 1.41e-14 ***
log_distance_to_parks -27029 9794 -2.760 0.00585 **
log_preschools_within_400m 11003 5713 1.926 0.05429 .
log_schools_within_400m 4203 5316 0.791 0.42928
log_median_storey 55758 3668 15.201 < 2e-16 ***
log_years_since_lease_commence -136177 4287 -31.764 < 2e-16 ***
log_floor_area_sqm 859580 48767 17.626 < 2e-16 ***
---Significance stars
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 84940 on 1629 degrees of freedom
Multiple R-squared: 0.6994
Adjusted R-squared: 0.6974
F-statistic: 344.5 on 11 and 1629 DF, p-value: < 2.2e-16
***Extra Diagnostic information
Residual sum of squares: 1.175429e+13
Sigma(hat): 84685.42
AIC: 41920.82
AICc: 41921.04
BIC: 40446.3
***********************************************************************
* Results of Geographically Weighted Regression *
***********************************************************************
*********************Model calibration information*********************
Kernel function: gaussian
Adaptive bandwidth: 43 (number of nearest neighbours)
Regression points: the same locations as observations are used.
Distance metric: Euclidean distance metric is used.
****************Summary of GWR coefficient estimates:******************
Min. 1st Qu. Median
Intercept -11436554.2 -3064686.5 -1217942.1
log_distance_to_eldercare -568010.3 -62367.5 -9860.7
log_distance_to_cbd -9438534.0 -506318.3 -206796.2
log_distance_to_hawker -240639.1 -99748.1 -44161.9
log_distance_to_mall -475424.0 -165282.9 -80837.9
log_distance_to_mrt -1335909.3 -149289.1 -100247.9
log_distance_to_parks -453622.8 -52820.3 517.6
log_preschools_within_400m -28983.4 5749.9 18729.3
log_schools_within_400m -112888.8 -25465.9 -9810.8
log_median_storey 28330.2 38574.1 47551.7
log_years_since_lease_commence -396205.3 -171516.1 -118938.1
log_floor_area_sqm -2247797.6 339575.5 578504.2
3rd Qu. Max.
Intercept 534245.7 22477664
log_distance_to_eldercare 38130.6 394258
log_distance_to_cbd -33750.5 3787337
log_distance_to_hawker 5332.0 443729
log_distance_to_mall -11268.9 1137156
log_distance_to_mrt -46246.5 233541
log_distance_to_parks 90287.4 1269492
log_preschools_within_400m 37297.2 96400
log_schools_within_400m 9748.4 673855
log_median_storey 59361.7 92068
log_years_since_lease_commence -85226.8 43027
log_floor_area_sqm 963719.8 2099297
************************Diagnostic information*************************
Number of data points: 1641
Effective number of parameters (2trace(S) - trace(S'S)): 244.048
Effective degrees of freedom (n-2trace(S) + trace(S'S)): 1396.952
AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 40365.14
AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 40106.68
BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 39746.41
Residual sum of squares: 3.499648e+12
R-square value: 0.9104959
Adjusted R-square value: 0.8948483
***********************************************************************
Program stops at: 2024-11-09 19:49:52.105098
test_data_sp <- test_data %>%
as_Spatial()
test_data_spclass : SpatialPointsDataFrame
features : 411
extent : 11755.72, 42441.24, 28501.49, 48154.92 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
variables : 40
names : address, postal, month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease, resale_price, remaining_lease_yr, remaining_lease_mth, ...
min values : 101 SIMEI ST 1, 080109, 2023-01, ANG MO KIO, 5 ROOM, 101, ANCHORVALE CRES, 01 TO 03, 104, 3Gen, 1975, 50 years 07 months, 488000, 50, 1, ...
max values : 997C BUANGKOK CRES, 824662, 2024-09, YISHUN, 5 ROOM, 997C, YUNG KUANG RD, 46 TO 48, 139, Type S2, 2020, 95 years 01 month, 1450000, 95, 11, ...
This code performs Geographically Weighted Regression (GWR) predictions on a test dataset using an adaptive bandwidth.
Bandwidth Selection for Test Data:
gwr_bw_test_adaptiveis calculated using cross-validation on thetest_data_spdataset to identify an optimal adaptive bandwidth. This ensures that the GWR model adapts to local data densities within the test dataset.GWR Prediction: The
gwr.predictfunction applies the GWR model, using the previously computedbw_adaptivefrom the training data, and makes predictions ontest_data_sp. The model uses the same Gaussian kernel and adaptive bandwidth, allowing for spatially varying predictions that account for the influence of local variables.
This setup allows the model to generate spatially aware predictions on unseen data, leveraging the training data’s local patterns to estimate resale prices in the test dataset.
gwr_bw_test_adaptive <- bw.gwr(resale_price ~ log_distance_to_eldercare + log_distance_to_cbd + log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt + log_distance_to_parks + log_preschools_within_400m + log_schools_within_400m + log_median_storey + log_years_since_lease_commence + log_floor_area_sqm,
data=test_data_sp,
approach="CV",
kernel="gaussian",
adaptive=TRUE,
longlat=FALSE)Adaptive bandwidth: 261 CV score: 2.708293e+12
Adaptive bandwidth: 169 CV score: 2.624083e+12
Adaptive bandwidth: 111 CV score: 2.506149e+12
Adaptive bandwidth: 76 CV score: 2.362042e+12
Adaptive bandwidth: 53 CV score: 2.209377e+12
Adaptive bandwidth: 40 CV score: 2.110797e+12
Adaptive bandwidth: 31 CV score: 1.984621e+12
Adaptive bandwidth: 26 CV score: 1.845974e+12
Adaptive bandwidth: 22 CV score: 1.722423e+12
Adaptive bandwidth: 20 CV score: 1.663859e+12
Adaptive bandwidth: 18 CV score: 1.64053e+12
Adaptive bandwidth: 18 CV score: 1.64053e+12
gwr_pred <- gwr.predict(
formula = resale_price ~ log_distance_to_eldercare + log_distance_to_cbd +
log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt +
log_distance_to_parks + log_preschools_within_400m +
log_schools_within_400m + log_median_storey +
log_years_since_lease_commence + log_floor_area_sqm,
data = train_data_sp,
predictdata = test_data_sp,
bw = bw_adaptive,
kernel = "gaussian",
adaptive = TRUE,
longlat = FALSE
)coords <- st_coordinates(joined_data)
coords_train <- st_coordinates(train_data)
coords_test <- st_coordinates(test_data)train_data <- train_data %>%
st_drop_geometry()test_data <- cbind(test_data, coords_test) %>%
st_drop_geometry()head(gwr_pred$SDF@data) Intercept_coef log_distance_to_eldercare_coef log_distance_to_cbd_coef
1 -687237.8 39792.38 -29427.61
2 -1801476.0 118135.27 -276284.05
3 -1731721.5 -26067.61 -320298.78
4 -1731721.5 -26067.61 -320298.78
5 264182.9 32503.65 -129791.08
6 -3157231.8 -86252.99 221432.28
log_distance_to_hawker_coef log_distance_to_mall_coef
1 -77805.69 -62840.16
2 -56186.52 -125819.06
3 -70223.88 -33314.58
4 -70223.88 -33314.58
5 -40456.81 -91071.18
6 -207260.01 -164763.48
log_distance_to_mrt_coef log_distance_to_parks_coef
1 -182929.483 -68880.62
2 -9570.946 -102371.54
3 -140599.491 -30990.31
4 -140599.491 -30990.31
5 -182282.183 -50060.30
6 -95812.948 97958.95
log_preschools_within_400m_coef log_schools_within_400m_coef
1 33679.05 -53124.014
2 32803.40 -21317.726
3 -20727.88 2265.544
4 -20727.88 2265.544
5 34923.72 -50718.916
6 37297.22 -25465.899
log_median_storey_coef log_years_since_lease_commence_coef
1 60244.01 -167736.15
2 59135.65 -91733.91
3 69387.92 -153224.61
4 69387.92 -153224.61
5 62339.57 -145775.69
6 37689.31 -137621.84
log_floor_area_sqm_coef prediction prediction_var
1 436231.9 727775.1 2880857279
2 712258.1 573442.2 2939356035
3 798510.9 680130.2 2598057240
4 798510.9 743709.7 2605157674
5 274358.9 599134.0 2795978724
6 781639.8 585460.8 2620203481
# Extract coefficients and test data as data frames
coefficients <- gwr_pred$SDF@data
test_data_df <- as.data.frame(test_data_sp)
# Calculate the predicted values manually
predicted_values <- coefficients$Intercept_coef +
coefficients$log_distance_to_eldercare_coef * test_data_df$log_distance_to_eldercare +
coefficients$log_distance_to_cbd_coef * test_data_df$log_distance_to_cbd +
coefficients$log_distance_to_hawker_coef * test_data_df$log_distance_to_hawker +
coefficients$log_distance_to_mall_coef * test_data_df$log_distance_to_mall +
coefficients$log_distance_to_mrt_coef * test_data_df$log_distance_to_mrt +
coefficients$log_distance_to_parks_coef * test_data_df$log_distance_to_parks +
coefficients$log_preschools_within_400m_coef * test_data_df$log_preschools_within_400m +
coefficients$log_schools_within_400m_coef * test_data_df$log_schools_within_400m +
coefficients$log_median_storey_coef * test_data_df$log_median_storey +
coefficients$log_years_since_lease_commence_coef * test_data_df$log_years_since_lease_commence +
coefficients$log_floor_area_sqm_coef * test_data_df$log_floor_area_sqm
# Create a comparison data frame
comparison_data <- data.frame(
Actual = test_data_sp$resale_price,
Predicted = predicted_values
)
# Plot the actual vs. predicted values
library(ggplot2)
ggplot(comparison_data, aes(x = Predicted, y = Actual)) +
geom_point(color = "black", alpha = 0.6) +
labs(title = "Actual vs Predicted Resale Prices",
x = "Predicted Resale Price",
y = "Actual Resale Price") +
theme_minimal()
library(ggplot2)
# Modify the x-axis limit, for example, setting it to a maximum of 1 million
ggplot(comparison_data, aes(x = Predicted, y = Actual)) +
geom_point(color = "black", alpha = 0.6) +
labs(title = "Actual vs Predicted Resale Prices",
x = "Predicted Resale Price",
y = "Actual Resale Price") +
xlim(0, 1e6) + # Set the x-axis limit to a maximum of 1 million
theme_minimal()